AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Kuo, En-Ya, Motsch, Sebastien

EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection

arXiv.org Machine LearningMar-17-2026

Imbalanced datasets pose a difficulty in fraud detection, as classifiers are often biased toward the majority class and perform poorly on rare fraudulent transactions. Synthetic data generation is therefore commonly used to mitigate this problem. In this work, we propose the Clustered Embedding Diffusion-Transformer (EmDT), a diffusion model designed to generate fraudulent samples. Our key innovation is to leverage UMAP clustering to identify distinct fraudulent patterns, and train a Transformer denoising network with sinusoidal positional embeddings to capture feature relationships throughout the diffusion process. Once the synthetic data has been generated, we employ a standard decision-tree-based classifier (e.g., XGBoost) for classification, as this type of model remains better suited to tabular datasets. Experiments on a credit card fraud detection dataset demonstrate that EmDT significantly improves downstream classification performance compared to existing oversampling and generative methods, while maintaining comparable privacy protection and preserving feature correlations present in the original data.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

2603.13566

Country:

North America > United States > Arizona > Maricopa County > Tempe (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Industry: Law Enforcement & Public Safety > Fraud (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Neural Information Processing SystemsFeb-8-2026, 03:07:00 GMT

Appendix A Proofs

This view of the data generating process is also adopted and promoted by popular existing works.

artificial intelligence, machine learning, nullnull, (18 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Neural Information Processing SystemsOct-2-2025, 23:26:44 GMT

6fb52e71b837628ac16539c1ff911667-AuthorFeedback.pdf

artificial intelligence, dim supervised mean 0, st-dim, (10 more...)

Technology: Information Technology > Artificial Intelligence (0.33)

Rožanec, Jože M., Žezlin, Tina, Vasiliu, Laurentiu, Mladenić, Dunja, Prodan, Radu, Roman, Dumitru

Fiaingen: A financial time series generative method matching real-world data quality

arXiv.org Artificial IntelligenceOct-2-2025

Data is vital in enabling machine learning models to advance research and practical applications in finance, where accurate and robust models are essential for investment and trading decision-making. However, real-world data is limited despite its quantity, quality, and variety. The data shortage of various financial assets directly hinders the performance of machine learning models designed to trade and invest in these assets. Generative methods can mitigate this shortage. In this paper, we introduce a set of novel techniques for time series data generation (we name them Fiaingen) and assess their performance across three criteria: (a) overlap of real-world and synthetic data on a reduced dimensionality space, (b) performance on downstream machine learning tasks, and (c) runtime performance. Our experiments demonstrate that the methods achieve state-of-the-art performance across the three criteria listed above. Synthetic data generated with Fiaingen methods more closely mirrors the original time series data while keeping data generation time close to seconds - ensuring the scalability of the proposed approach. Furthermore, models trained on it achieve performance close to those trained with real-world data.

artificial intelligence, deep learning, machine learning, (18 more...)

2510.01169

Country:

Europe (1.00)
North America > United States (0.28)

Genre: Research Report > Promising Solution (0.66)

Industry:

Banking & Finance > Trading (1.00)
Information Technology (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceSep-25-2025

Self-Alignment Learning to Improve Myocardial Infarction Detection from Single-Lead ECG

Jin, Jiarui, Fang, Xiaocheng, Wang, Haoyu, Li, Jun, Liu, Che, Xie, Donglin, Li, Hongyan, Hong, Shenda

Myocardial infarction is a critical manifestation of coronary artery disease, yet detecting it from single-lead electrocardiogram (ECG) remains challenging due to limited spatial information. An intuitive idea is to convert single-lead into multiple-lead ECG for classification by pre-trained models, but generative methods optimized at the signal level in most cases leave a large latent space gap, ultimately degrading diagnostic performance. This naturally raises the question of whether latent space alignment could help. However, most prior ECG alignment methods focus on learning transformation invariance, which mismatches the goal of single-lead detection. To address this issue, we propose SelfMIS, a simple yet effective alignment learning framework to improve myocardial infarction detection from single-lead ECG. Discarding manual data augmentations, SelfMIS employs a self-cutting strategy to pair multiple-lead ECG with their corresponding single-lead segments and directly align them in the latent space. This design shifts the learning objective from pursuing transformation invariance to enriching the single-lead representation, explicitly driving the single-lead ECG encoder to learn a representation capable of inferring global cardiac context from the local signal. Experimentally, SelfMIS achieves superior performance over baseline models across nine myocardial infarction types while maintaining a simpler architecture and lower computational overhead, thereby substantiating the efficacy of direct latent space alignment. Our code and checkpoint will be publicly available after acceptance.

artificial intelligence, ecg encoder, machine learning, (10 more...)

2509.19397

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.66)

Amaranath, Pracheta, Muralikrishnan, Vinitra, Sharma, Amit, Jensen, David D.

Improving Generative Methods for Causal Evaluation via Simulation-Based Inference

arXiv.org Artificial IntelligenceSep-4-2025

Generating synthetic datasets that accurately reflect real-world observational data is critical for evaluating causal estimators, but remains a challenging task. Existing generative methods offer a solution by producing synthetic datasets anchored in the observed data (source data) while allowing variation in key parameters such as the treatment effect and amount of confounding bias. However, existing methods typically require users to provide point estimates of such parameters (rather than distributions) and fixed estimates (rather than estimates that can be improved with reference to the source data). This denies users the ability to express uncertainty over parameter values and removes the potential for posterior inference, potentially leading to unreliable estimator comparisons. We introduce simulation-based inference for causal evaluation (SBICE), a framework that models generative parameters as uncertain and infers their posterior distribution given a source dataset. Leveraging techniques in simulation-based inference, SBICE identifies parameter configurations that produce synthetic datasets closely aligned with the source data distribution. Empirical results demonstrate that SBICE improves the reliability of estimator evaluations by generating more realistic datasets, which supports a robust and data-consistent approach to causal benchmarking under uncertainty.

artificial intelligence, bayesian inference, machine learning, (15 more...)

2509.02892

Country: North America > United States > Massachusetts (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Cestero, Julen, Quartulli, Marco, Restelli, Marcello

Building surrogate models using trajectories of agents trained by Reinforcement Learning

arXiv.org Artificial IntelligenceSep-3-2025

Sample efficiency in the face of computationally expensive simulations is a common concern in surrogate modeling. Current strategies to minimize the number of samples needed are not as effective in simulated environments with wide state spaces. As a response to this challenge, we propose a novel method to efficiently sample simulated deterministic environments by using policies trained by Reinforcement Learning. We provide an extensive analysis of these surrogate-building strategies with respect to Latin-Hypercube sampling or Active Learning and Kriging, cross-validating performances with all sampled datasets. The analysis shows that a mixed dataset that includes samples acquired by random agents, expert agents, and agents trained to explore the regions of maximum entropy of the state transition distribution provides the best scores through all datasets, which is crucial for a meaningful state space representation. We conclude that the proposed method improves the state-of-the-art and clears the path to enable the application of surrogate-aided Reinforcement Learning policy optimization strategies on complex simulators.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

doi: 10.1007/978-3-031-72341-4_23

2509.01285

Country: Europe (0.46)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.46)

Industry: Energy (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsMay-27-2025, 07:18:10 GMT

TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models

classification performance, distinct class-specific energy-based model, tabular data augmentation method, (6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Croitoru, Florinel-Alin, Hondru, Vlad, Popescu, Marius, Ionescu, Radu Tudor, Khan, Fahad Shahbaz, Shah, Mubarak

MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

arXiv.org Artificial IntelligenceMay-19-2025

We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: https://huggingface.co/datasets/unibuc-cs/MAVOS-DD.

artificial intelligence, machine learning, proceedings, (15 more...)

2505.11109

Country:

Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.04)
Europe > Sweden > Östergötland County > Linköping (0.04)
Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)